Analysis of Online News Popularity Dataset (https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity): explore the statistical summaries of the features, visualize the attributes, and make conclusions from the visualizations and analysis
Describe the purpose of the dataset you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific *
This Online News Popularity Dataset was acquisited from Mashable, a popular news and opinion website that focuses on social media and technology.
The dataset is available for download at the UCI Machine Learning Repository
The data consists of 39,644 records with 61 attributes that provide details and meta-data about online news articles that were published by Mashable over a two year period ranging 2013 - 2015. The goal of collecting this data was to predict the popularity of a news article. Popularity in this case is defined as the number of times that the article is shared across all social media platforms.
Our group would like to create a classification model that can determine if an article will be popular based on the amount of times it is shared on social media. We have determined that an article can be classified as popular if it is shared 1,400 number of times. The model will have a binary output: popular / not popular.
We chose not to predict the actual number of shares that an article will have once published since that would require complex machine learning algorithms that are outside the scope of this lab assignment. In addition, if the accuracy of the prediction model is low then it could potentially provide publishers with false hope such as predicting that their article will have 1,000 shares but in reality it only yields 10 shares. Whereas the binary output doesn't promise something unrealistic but simply tells the author whether the article surpasses the chosen cutoff value of 1,400 shares.
This data is important because it can be used to help Mashable and other online publishers understand the factors that play a part in how popular their articles are. With a reliable model for predicting popularity Mashable will understand how they can design articles to achieve maximum popularity and exposure, which is the key objective of any publishing company.
We will measure the effectiveness of our alorithm by the accuracy of its classification. We feel that a model that achieves a success rate of 73% would qualify as an effective model. This success rate is based on the study carried out by K. Fernandes et al. that outlines the initial data colleciton, pre-processing, and prediction models for this data set.
To keep this Jupyter notebook organized, we decided to import the necessary Python libraries all together at the beginning rather than doing so separately as we move through each section. The following libraries will be used for this lab assignment:
# Import general libraries which will be uses for this Lab_01 project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
In the code below we import the data set starting with the 3rd variable as the first two variables were non-predictive and will not help in our analysis.
Those variables were:
# Read csv file
df = pd.read_csv("OnlineNewsPopularity.csv")
# Exclude url and timedelta columns, read from n_tokens_title
df = df.loc[:, ' n_tokens_title':]
dfCopy = df.copy() # dfCopy.info() # use df.tail() to read from the bottom
The code below generates a table of the variables remaining in the dataset. For each attribute we list:
from IPython.display import display, HTML
variables_description = [
[' n_tokens_title', 'ratio', 'TBD', 'Number of words in the title']
,[' n_tokens_content', 'ratio', 'TBD', 'Number of words in the content 0-8500 ']
,[' n_unique_tokens', 'ratio', 'TBD', 'Rate of unique words in the content']
,[' n_non_stop_words', 'ratio', 'TBD', 'Rate of non-stop words in the content']
,[' n_non_stop_unique_tokens', 'ratio', 'TBD', 'Rate of unique non-stop words in the content']
,[' num_hrefs', 'ratio', 'TBD', 'Number of links']
,[' num_self_hrefs', 'ratio', 'TBD', 'Number of links to other articles published by Mashable']
,[' num_imgs', 'ratio', 'TBD', 'Number of images']
,[' num_videos', 'ratio', 'TBD', 'Number of videos']
,[' average_token_length', 'ratio', 'TBD', '(Binary) Average length of the words in the content']
,[' num_keywords', 'ratio', 'TBD', '(Binary) Number of keywords in the metadata']
,[' data_channel_is_lifestyle', 'nominal', 'TBD', '(Binary) Is data channel Lifestyle?']
,[' data_channel_is_entertainment', 'nominal', 'TBD', '(Binary) Is data channel Entertainment?']
,[' data_channel_is_bus', 'nominal', 'TBD', '(Binary) Is data channel Business?']
,[' data_channel_is_socmed', 'nominal', 'TBD', '(Binary) Is data channel Social Media? ']
,[' data_channel_is_tech', 'nominal', 'TBD', '(Binary) Is data channel Tech?']
,[' data_channel_is_world', 'nominal', 'TBD', '(Binary) Is data channel World?']
,[' kw_min_min', 'ratio', 'TBD', 'Worst keyword (min. shares)']
,[' kw_max_min', 'ratio', 'TBD', 'Worst keyword (max. shares)']
,[' kw_avg_min', 'ratio', 'TBD', 'Worst keyword (avg. shares)']
,[' kw_min_max', 'ratio', 'TBD', 'Best keyword (min. shares)']
,[' kw_max_max', 'ratio', 'TBD', 'Best keyword (max. shares)']
,[' kw_avg_max', 'ratio', 'TBD', 'Best keyword (avg. shares)']
,[' kw_min_avg', 'ratio', 'TBD', 'Avg. keyword (min. shares)']
,[' kw_max_avg', 'ratio', 'TBD', 'Avg. keyword (max. shares)']
,[' kw_avg_avg', 'ratio', 'TBD', 'Avg. keyword (avg. shares)']
,[' self_reference_min_shares', 'ratio', 'TBD', 'Min. shares of referenced articles in Mashable']
,[' self_reference_max_shares', 'ratio', 'TBD', 'Max. shares of referenced articles in Mashable']
,[' self_reference_avg_sharess', 'ratio', 'TBD', 'Error or the Zillow model response variable']
,[' weekday_is_monday', 'nominal', 'TBD', '(Binary) Was the article published on a Monday? ']
,[' weekday_is_tuesday', 'nominal', 'TBD', '(Binary) Was the article published on a Tuesday? ']
,[' weekday_is_wednesday', 'nominal', 'TBD', '(Binary) Was the article published on a Wednesday?']
,[' weekday_is_thursday', 'nominal', 'TBD', '(Binary) Was the article published on a Thursday?']
,[' weekday_is_friday', 'nominal', 'TBD', '(Binary) Was the article published on a Friday? ']
,[' weekday_is_saturday', 'nominal', 'TBD', '(Binary) Was the article published on a Saturday? ']
,[' weekday_is_sunday', 'nominal', 'TBD', '(Binary) Was the article published on a Sunday? ']
,[' is_weekend', 'nominal', 'TBD', '(Binary) Was the article published on the weekend?']
,[' LDA_00', 'ratio', 'TBD', 'Closeness to LDA topic 0']
,[' LDA_01', 'ratio', 'TBD', 'Closeness to LDA topic 1']
,[' LDA_02', 'ratio', 'TBD', 'Closeness to LDA topic 2']
,[' LDA_03', 'ratio', 'TBD', 'Closeness to LDA topic 3']
,[' LDA_04', 'ratio', 'TBD', 'Closeness to LDA topic 4 ']
,[' global_subjectivity', 'ratio', 'TBD', 'Text subjectivity']
,[' global_sentiment_polarity', 'ratio', 'TBD', 'Text sentiment polarity ']
,[' global_rate_positive_words', 'ratio', 'TBD', 'Rate of positive words in the content ']
,[' global_rate_negative_words', 'ratio', 'TBD', 'Rate of negative words in the content ']
,[' rate_positive_words', 'ratio', 'TBD', 'Rate of positive words among non-neutral tokens']
,[' rate_negative_words', 'ratio', 'TBD', 'Rate of negative words among non-neutral tokens']
,[' avg_positive_polarity', 'ratio', 'TBD', 'Avg. polarity of positive words ']
,[' min_positive_polarity', 'ratio', 'TBD', 'Min. polarity of positive words ']
,[' max_positive_polarity', 'ratio', 'TBD', 'Max. polarity of positive words ']
,[' avg_negative_polarity', 'ratio', 'TBD', 'Avg. polarity of negative words']
,[' min_negative_polarity', 'ratio', 'TBD', 'Min. polarity of negative words']
,[' max_negative_polarity', 'ratio', 'TBD', 'Max. polarity of negative words']
,[' title_subjectivity', 'ratio', 'TBD', 'Title subjectivity']
,[' title_sentiment_polarity', 'ratio', 'TBD', 'Title polarity']
,[' abs_title_subjectivity', 'ratio', 'TBD', 'Absolute subjectivity level']
,[' abs_title_sentiment_polarity', 'ratio', 'TBD', 'Absolute polarity level']
,[' shares', 'ratio', 'TBD', 'Number of shares (target)']
]
variables = pd.DataFrame(variables_description, columns=['name', 'type', 'scale','description'])
variables = variables.set_index('name')
variables = variables.loc[df.columns]
def output_variables_table(variables): #variables = variables.sort_index()
rows = ['<tr><th>Variable</th><th>Type</th><th>Scale</th><th>Description</th></tr>']
for vname, atts in variables.iterrows():
atts = atts.to_dict()
# add scale if TBD
if atts['scale'] == 'TBD':
if atts['type'] in ['nominal', 'ordinal']:
uniques = df[vname].unique()
uniques = list(uniques.astype(str))
if len(uniques) < 10:
atts['scale'] = '[%s]' % ', '.join(uniques)
else:
atts['scale'] = '[%s]' % (', '.join(uniques[:5]) + ', ... (%d More)' % len(uniques))
if atts['type'] in ['ratio', 'interval']:
atts['scale'] = '(%d, %d, %d, %d)' % (df[vname].min(), df[vname].mean(), df[vname].median(), df[vname].max())
row = (vname, atts['type'], atts['scale'], atts['description'])
rows.append('<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>' % row)
return HTML('<table>%s</table>' % ''.join(rows))
output_variables_table(variables)
Concise summary of the dataset and quickly check if the data has the right type.
df.info()
df.head()
Every attribute in this dataset contains a space(" ") at the begining of its name, the code below will remove that extra space and make handling the dataset a little easier.
# Strip the empty space in varible names
df.columns = df.columns.str.replace(' ', '')
In the code snippet below, the following attributes have been consolidated into a single attribute new "Channel":
The consolidation makes sense because the new categorical variable will enable analysis to be performed groups by "Channel".
# Combine and make 'channel' with 6 data_channel variables
Lifestyle_df=df[df['data_channel_is_lifestyle']==1].copy()
Lifestyle_df['Channel']='Lifestyle'
Entertainment_df=df[df['data_channel_is_entertainment']==1].copy()
Entertainment_df['Channel']='Entertainment'
Bus_df=df[df['data_channel_is_bus']==1].copy()
Bus_df['Channel']='Bus'
Socmed_df=df[df['data_channel_is_socmed']==1].copy()
Socmed_df['Channel']='Socmedia'
Tech_df=df[df['data_channel_is_tech']==1].copy()
Tech_df['Channel']='Tech'
World_df=df[df['data_channel_is_world']==1].copy()
World_df['Channel']='World'
World_df=df[df['data_channel_is_world']==1].copy()
World_df['Channel']='World'
df=pd.concat([Lifestyle_df,Entertainment_df,Bus_df,Socmed_df,Tech_df,World_df],axis=0)
sum(df['Channel'].value_counts()) # Check if the sample size is the same as original 33,510
In the code snippet below the following attributes have been consolidated into a single attribute new "Weekday":
The consolidation makes sense because the new gategorical variable will enable analysis to be performed groups by "Weekday".
# Combine and make 'Weekday' with 7 weekday variables
Monday_df=df[df['weekday_is_monday']==1].copy()
Monday_df['Weekday']='Monday'
Tuesday_df=df[df['weekday_is_tuesday']==1].copy()
Tuesday_df['Weekday']='Tuesday'
Wednesday_df=df[df['weekday_is_wednesday']==1].copy()
Wednesday_df['Weekday']='Wednesday'
Thursday_df=df[df['weekday_is_thursday']==1].copy()
Thursday_df['Weekday']='Thursday'
Friday_df=df[df['weekday_is_friday']==1].copy()
Friday_df['Weekday']='Friday'
Saturday_df=df[df['weekday_is_saturday']==1].copy()
Saturday_df['Weekday']='Saturday'
Sunday_df=df[df['weekday_is_sunday']==1].copy()
Sunday_df['Weekday']='Sunday'
df=pd.concat([Monday_df,Tuesday_df,Wednesday_df,Thursday_df,Friday_df,Saturday_df,Sunday_df],axis=0)
sum(df['Weekday'].value_counts()) # Check if the sample size is the same as original 33,510
# Check column locations and prepare to drop
df.columns[[11, 12, 13, 14, 15, 16, 29, 30, 31,32, 33, 34, 35, 36 ]]
# Remove previous channel and weekly columns as mentioned above
df.drop(df.columns[[11, 12, 13, 14, 15, 16, 29, 30, 31,32, 33, 34, 35, 36 ]], axis=1, inplace=True)
To perform a check for null values in a dataset the isnull function from the pandas package can be used.
As shown in the code snippet below, there are no null values in this dataset.
# No Missing values in this dataset
pd.isnull(df).sum()
To perform a check for duplicate records in a dataset the dataframe.duplicated function from the pandas package can be used.
As shown in the code snippet below, there are no duplicated records in this dataset.
# No duplicated values in this dataset
df[df.duplicated(keep=False)]
To perform a quick check for the presence of outliers and distribution in a dataset the dataframe.describe function from the pandas package can be used.
As shown in the code snippet below, some attributes in the dataset have wide and skewed distribution, and log transformation should be performed for good data quality.
df.describe().transpose()
For the log transformation, we follow severlrules to choose varaibles. For example, the "max" value is 10-fold larger than "50% value", and add a fixed value (e.g. 2) for negative values. The log transformed dataset is checked by df.dtypes and df.shape (with sample size 33,510 and 47 variables).
As shown in the section 04: Simple Statistics, the min and max values for each attribute in the dataset fall into a reasonable range and do not indicate the presence of any outliers that will impact the effectiveness of the classification model.
# Outliers will be handled by log transformation due to the sample numbers are more than 35k
# Find which variables need to make log transformation
df_T = df.describe().T
df_T["log"] = (df_T["max"] > df_T["50%"] * 10) & (df_T["max"] > 1) # max > 10*50% value and max>1
df_T["log+2"] = df_T["log"] & (df_T["min"] < 0) # Need add 2 when min <0
df_T["scale"] = "" # make new variable 'scale' in df_T
df_T.loc[df_T["log"],"scale"] = "log" # show 'log'
df_T.loc[df_T["log+2"],"scale"] = "log+2" # show 'log+2'
df_T[["mean", "min", "50%", "max", "scale"]] # show mean, min, 50%, max, scale
# Log transform 18 variables
df['log_n_tokens_content'] = np.log(df['n_tokens_content'] + 0.1) # Add 0.1 to prevent infinity, same as below
df['log_n_unique_tokens'] = np.log(df['n_unique_tokens'] + 0.1)
df['log_n_non_stop_words'] = np.log(df['n_non_stop_words'] + 0.1)
df['log_n_non_stop_unique_tokens'] = np.log(df['n_non_stop_unique_tokens'] + 0.1)
df['log_num_hrefs'] = np.log(df['num_hrefs'] + 0.1)
df['log_num_self_hrefs'] = np.log(df['num_self_hrefs'] + 0.1)
df['log_num_imgs'] = np.log(df['num_imgs'] + 0.1)
df['log_num_videos'] = np.log(df['num_videos'] + 0.1)
df['log_kw_min_min'] = np.log(df['kw_min_min'] + 2) # Add 2 for "log+2' to prevent infinity, same as below
df['log_kw_max_min'] = np.log(df['kw_max_min'] + 0.1)
df['log_kw_avg_min'] = np.log(df['kw_avg_min'] + 2)
df['log_kw_min_max'] = np.log(df['kw_min_max'] + 0.1)
df['log_kw_max_avg'] = np.log(df['kw_max_avg'] + 0.1)
df['log_kw_avg_avg'] = np.log(df['kw_avg_avg'] + 0.1)
df['log_self_reference_min_shares'] = np.log(df['self_reference_min_shares'] + 0.1)
df['log_self_reference_max_shares'] = np.log(df['self_reference_max_shares'] + 0.1)
df['log_self_reference_avg_sharess'] = np.log(df['self_reference_avg_sharess'] + 0.1)
df['log_shares'] = np.log(df['shares'] + 0.1)
# find locations for corresponding untransformed columns
df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 44]]
# Drop the above columns
df.drop(df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 44]], axis=1, inplace=True)
# Check if everything correct so far
df.dtypes
# Check data shape
df.shape
Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting.
# Option to view all the avialable columns, default = 40
pd.options.display.max_columns = 59
# Quick statistic summary of the original data set
dfCopy.describe()
With the goal of classifying whether an article is popular based on the number of shares, it seemed logical to start with that attribute. You will have to scroll right all the way to the end of the table above to find the shares attribute.
First thing that stands out right away is the large range with some articles only getting 1 single share while the maximum number of shares comes in at 690,400 shares. To get a better idea where the majority of articles can be expected with respect to shares, we look at the 25% percentile with 930 shares and the 75% percentile with 2,500 shares. Clearly, this looks like a heavily right skewed distribution which is futher confirmed by the mean number of shares at 2,929 being higher than the median at 1,400 shares suggesting that the mean is being heavily influenced by outliers on the high end of the scale.
To illustrate the extent of the right skewness and the need for a log transformation to better interpret the distribution, we have included the histogram and boxplots below. This serves as a visual check for outliers that might not have been dealt with in the Data Quality section which in turn could improve the prediction accuracy for any classification algorithm used for this data set.
# histogram of untransformed shares attribute
plt.subplot(1, 2, 1)
plt.hist(dfCopy[' shares'], bins = 20)
plt.xlabel('Number of shares')
plt.ylabel('Frequency')
# boxplot of untransformed shares attribute
plt.subplot(1, 2, 2)
plt.boxplot(dfCopy[' shares'])
plt.xlabel('Number of shares')
# histogram of log transformed shares attribute
plt.subplot(2, 2, 1)
plt.hist(df['log_shares'], bins = 20)
plt.xlabel('Log Shares')
plt.ylabel('Frequency')
# boxplot of log transformed shares attribute
plt.subplot(2, 2, 2)
plt.boxplot(df['log_shares'])
plt.xlabel('Log Shares')
We already pointed out that the median number of shares is 1,400. Calculation below is to see what the corresponding median value will be for the log transformed shares attribute.
df[['log_shares']].median()
We will use this median value as the cutoff between popular and unpopular articles in our classification model. This classification will be stored in a newly created attribute labeled 'popular'.
df['popular'] = np.where(df['log_shares'] >= 7.244299, True, False)
Looking through the describe summary statistics table, we noticed multiple attributes that have similar labels and seem to represent the same measure. The screenshot below from a Stanford study provides an organized grouping of similar attributes by the aspect of the article that it relates to. For example, the digital media aspect of the article is described by the number of images (num_imgs) and videos (num_videos) attributes.
We decided to discuss the important summary statistics of these aspects rather than go through the individual attributes one by one.

Looking at the statistics summary table, it didn't surprise us that the titles for these articles are kept short. The mean title word count is just over 10 words with a tight standard deviation of 2 words because publishers know that today's readers would easily lose interest before even opening the article just based off a long title.
Going through the body of the article, there are several specific measures of the rate and type of words used as listed in the screenshot above under the words section.
The attribute that comes to mind as the most important with regards to popularity is number of words in the article (n_tokens_content). The mean word count in this data set is 585 which seems short enough to hold the readers' attention for the full article and have a higher chance of sharing it. However, a 2013 study conducted by Newswhip Analytics showed that articles in the 500 - 800 word count had the lowest chance of social media success which is why some news websites such as Quartz decided not to publish articles within that range. Considering the standard deviation of almost 484 words and max word count of 8474, there are plenty of articles that fall out of that range. Lastly, the minimum word count of 0 initially got our attention as possible outliers but upon closer inspection we found that these articles consisted only of images and videos.
After performing an extensive search, we didn't find any literature to compare with regards to rate of non-stop words (n_non_stop_words), rate of unique words (n_unique_tokens), and rate of unique non-stop words (n_non_stop_unique_tokens). Thus, the simple statistics for these attributes didn't seem as pertinent to review.
It would seem that including a lot of links within an article would easily distract a reader but surprisingly for this data set the mean number of links per article is just over 10. Upon closer examination, the 50% percentile corresponds to 7 links per article with the maximum coming in at an enormous 304 links so there is definitely some right skewness in the distribution.
There is an attribute measuring the number of Mashable article links (num_self_hrefs) per article. The mean number of Mashable links comes in at 3 per article which is lower than the mean of 10 total links discussed in the paragraph above. This seems reasonable since an article that only includes Mashable links might look like a self-promotion strategy for the website and lower the article's credbility.
The mean number of images (num_imgs) and videos (num_videos) per article is ~ 4 and 1 respectively. This seems reasonable that articles would have at least a few pictures and videos to break up large text paragraphs in an effort to hold the readers' attention.
Looking at the minimum values, it is interesting to see that there are still articles with neither any images or videos included which we believe would not be considered popular on social media. On the other end of the spectrum, the max values of 128 images and 75 videos are pretty extreme and we would imagine most readers would not have the attention span to get through such an article in its entirety.
From the data description as shown below, we can see all variables have the same sample size 33,510, indicating the proper data management with log transformation. The values (mean, std, min, 25%, 50%, 75% and max) for each attribute in the dataset show approximately normal distribution and reasonable standard deviations. This observation can further be confirmed with histogram for representative varaibles (see bellow at this section end). Further statistical anylysis can be performed with this dataset.
# Quick statistic summary of the transformed data set
df.describe().transpose()
We can use linear regression to check which variables are nearly significant with the target variable "Log_share".
# Pioneer Linear Regression analysis with numerical variables
class_y = df.log_shares
class_X = df.drop(['log_shares', 'Channel', 'Weekday'], axis=1) # axis = 1 - column
import statsmodels.api as sm
class_X = sm.add_constant(class_X)
ls_model = sm.OLS(class_y.astype(float), class_X.astype(float)).fit()
ls_model.summary()
From the initial anlaysis there does not appear to be many variable that will be useful in trying to predict the number of share, and we see a warning about strong multicollinearity. In the presence of multicollinearity, regression estimates are unstable and have high standard errors.
From the results of a pairplotting exercise (see section 6) we found the following variables to be interesting and likely signficant in our linear regression analysis.
The code below reruns the linear regression analysis with the variable identified from the pair plotting exercise.
# Remove some varibles after checking pairplot and keep 12 and target 'log_shares'.
df_clean = df[['average_token_length', 'num_keywords', 'global_subjectivity','title_sentiment_polarity',
'abs_title_subjectivity', 'log_n_unique_tokens','log_num_hrefs',
'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess', 'log_shares']]
# Recheck linear regression analysis.
clean_y = df_clean.log_shares
clean_X = df_clean.drop(['log_shares'], axis=1) # axis = 1 - column
import statsmodels.api as sm
clean_X = sm.add_constant(clean_X)
clean_ls_model = sm.OLS(clean_y.astype(float), clean_X.astype(float)).fit()
clean_ls_model.summary()
From the results we can see the issue of multicollinearity has been resolved and the majority of the variables in the model are signifcant, 10 of the 12 had P-values lower than .05.
The code below generates a historgram for each of the variables in the model. While the histograms do not appear to show ideal distribution for a couple of the attributes we belive the large amount of records (over 33 thousand) this help address any concern about skewness or uneven distribution forthose attributes.
# Histogram for most interesting attributes
df_clean.hist(figsize=(12,12))
For this section, rather than going through and visualizing the attributes according to their corresponding aspect as we did in the Simple Statistics section, we decided to focus on the attributes that had obvious trends.
The Weekday attribute was the most obvious trend that we noticed right away. The barplot below demonstrates that the number of articles published during the week is greater than the weekend. Drawing an anecdotal parallel to the typical workweek, this pattern seems to represent the productivity of white collar office workers where Monday starts off slower and picks up mid-week and eventually drops off by the weekend which might also be the case for publishers sucha as Mashable. This barplot doesn't give us an insight into the effect that publishing on a certain weekday has on the popularity of an article.
# Create countplot for each weekday
sns.countplot('Weekday', data = df)
The side by side plot below breaks out whether the article is popular (True - yellow bar) or not (False - blue bar) and assigns it to the corresponding weekday during which it was published.
The interesting trend that we noticed here is that the proportion of popular articles tends to be higher during the weekend than during the week. A possible explanation for this could be that a typical reader has more leisure time during the weekend to get through several articles and is more likely to share them on social media.
This difference in the ratio of popular to unpopular articles is important to understand as it could influence the predictive accuracy of a classification model.
# Categorical plot splitting out the popular versus unpopular articles by weekday
sns.catplot('popular', col="Weekday", col_wrap = 5, data=df, kind="count", height=3, aspect=.8)
The Channel attribute is another one that we immediately noticed a strong relationship. The horizontal barchart below shows a direct comparison in the number of articles published for each data channel. We can see that the lifestyle category has the lowest number of articles published for this data set while the World category has the highest number.
Again, this chart only provides part of the picture. It would seem that publishers such as Mashable are catering to their readers with respect to the type of content they provide. For this reason we need to take a closer look at what proportion of articles are considered popular within each channel.
# Create countplot for each data channel
sns.countplot(y = 'Channel', data = df)
The plot below highlights some interesting differences in the ratio of popular articles for each of data channels. Even though the World channel has the most articles published in this data set, it also has the largest number of unpopular articles which is closely followed by the Entertainment channel. On the other hand, the Tech, Socmedia, and Lifestyle channels have higher proportions of popular articles. This might be a result of the readers having a greater personal interest, such as a hobby, in these channels making them more likely to share the article on social media.
# Categorical plot splitting out the popular versus unpopular articles by data channel
sns.catplot('popular', col="Channel", col_wrap = 3, data=df, kind="count", height=3, aspect=.8)
We all often hear the phrase "a picture says a thousand words" so it would be interesting to see whether that holds true for this data set. Since we already examined the distribution for the number of images (num_imgs) and videos (num_videos) per article in the Summary Statistics section, we wanted to tie in the channel attribute discussed above.
The first violin plot below shows the log transformed number of images categorized by the data channel and with the distribution color coded by whether the article is popular or not. It seems that most of the channels are trimodal with the exception of the business channel. Comparing the distributions of the popular versus unpopular articles, there doesn't seem to be any obvious deviations which suggests that this attribute might not have a strong influence on the popularity.
# Violin plot of popular versus unpopular articles measured by log_num_images categorized by data channel
sns.violinplot(x = "Channel", y = "log_num_imgs", hue = "popular", data = df)
The same violin plot was created for the log transformed numer of videos. The main difference here when comparing to the images plot above is that most of the channels have a weak bimodal distribution. The entertainment channel two strong peaks throughout it's distribution suggesting these articles typically have more videos included in comparison to other channels. Again, there is no obvious differentiation between the popular and unpopular articles.
# Violin plot of popular versus unpopular articles measured by log_num_videos categorized by data channel
sns.violinplot(x = "Channel", y = "log_num_videos", hue = "popular", data = df)
The natural language processing attributes in this data set require some research to gain an understanding into what they represent. Prior to performing a lot of research into these attributes, we find some obvious trends that could possibly be explained without the domian knowledge.
The kernel density estimation plot below shows that the rate of positive words (left-hand side) is almost a mirror image of the rate of negative words (right-hand side) of an article. It would seem that including both of these attributes in a classification model could potentially lower the accuracy. It would be interesting to test this is theory when we build our model.
# kde plot for rate of positive words
plt.subplot(2, 2, 1)
sns.kdeplot(np.array(df['rate_positive_words']))
plt.xlabel('Rate of positive words')
plt.ylabel('Frequency')
# kde plot for rate of negative words
plt.subplot(2, 2, 2)
sns.kdeplot(np.array(df['rate_negative_words']))
plt.xlabel('Rate of negative words')
We wanted to understand if the number of words in an article, be it the title or the body, has an obvious relationship with how popular it is.
First, we looked at the distribution of the number of words in the title (n_tokens_title). The plot below displays a normal distribution with the peak frequency right around the 10 word count mark.
# distribution plot of title word count
ax = sns.distplot(dfCopy[' n_tokens_title'], bins = 20, kde = False)
plt.xlabel('Number of words in the title')
plt.ylabel('Frequency')
plt.title('Distribution of Title Word Count')
ax.set_xlim(0, 20)
As we discussed in the Summary Statistics section, the number of words in the body of the article (n_tokens_content) has a heavily right skewed distribution. It seems that the majority of the pulbished articles tend to stay within the 2,000 word count limit but surprsingly, there are plenty of articles that push that limit. By applying a log transformation to this attribute, we were able to improve this to a normal distribution with a small left tail. The diagram below shows the difference in the untransformed versus (left-hand side) the log transformed distribution (right-hand side).
By looking through these distributions, we are able to make a better judgement on which variables we will include in our classification model.
# side by side distribution plots of untransformed and log transformed article word count
fig, ax = plt.subplots(1,2)
sns.distplot(dfCopy[' n_tokens_content'], kde = False, ax = ax[0])
sns.distplot(df['log_n_tokens_content'], kde = False, ax = ax[1])
fig.show()
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.*
In the Scatterplot Matrix for 12 representative variables as shown below, the diagonal Axes are drawing a plot to show the univariate distribution of the data for the variable in that column (approximately normal distribution), and the grid of Axes for each variable in dataset will by shared in the y-axis across a single row and in the x-axis across a single column. The distributions are further grouped by "Weekday" and "Channel".
Need to mention that the pairplot for all the numercial variables in this dataset (not shown) helped us to identify the variables with multicollinearity (as mentioned above).
The variables ('average_token_length', 'num_keywords', 'global_subjectivity','title_sentiment_polarity', 'abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs', 'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess') will be used for further analysis.
# Pairplot for log transformed variables, as grouped by Weekday
sns.pairplot(df, vars=['average_token_length', 'num_keywords', 'global_subjectivity','title_sentiment_polarity', 'abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs',
'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess'], hue="Weekday", palette="husl", height=2)
# Pairplot for log transformed variables, as grouped by Channel
sns.pairplot(df, vars=['average_token_length', 'num_keywords', 'global_subjectivity','title_sentiment_polarity','abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs','log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess'], hue="Channel", palette="husl", height=2)
# sns.boxplot(x= df['log_n_non_stop_words'])
# Pairplot for log transformed variables, as grouped by Weekday
f, (ax0,ax2) = plt.subplots(nrows=2, ncols=1, figsize=[15, 7])
sns.boxplot(df['log_n_non_stop_words'], ax=ax0, color="#34495e").set_title('Distribution of log_n_non_stop_words');
sns.distplot(df['log_n_non_stop_words'], ax=ax2, color="#34495e");
The code below produces a correlation heatmap for the following four variables:
# words_vars=[' log_n_non_stop_words',' log_n_non_stop_unique_tokens',' log_n_unique_tokens', ' log_shares']
words_cols = df[['log_n_non_stop_words','log_n_non_stop_unique_tokens','log_n_unique_tokens', 'log_shares']].copy()
words_cols.describe()
print(words_cols.corr())
corrs = words_cols.corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corrs, ax=ax)
plt.title(" Words aspect features correlation map", fontsize=20)
g = sns.PairGrid(words_cols)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);
words_cols = dfCopy[[' kw_min_min',' kw_max_min',' kw_avg_min', ' kw_min_max',' kw_max_max', ' kw_avg_max',' kw_min_avg',' kw_max_avg', ' kw_avg_avg']].copy()
words_cols.describe()
print(words_cols.corr())
words_cols = dfCopy[[' self_reference_min_shares',' self_reference_max_shares',' self_reference_avg_sharess']].copy()
words_cols.describe()
print(words_cols.corr())
g = sns.PairGrid(words_cols)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);
words_cols = dfCopy[[' avg_positive_polarity', ' min_positive_polarity', ' max_positive_polarity', ' avg_negative_polarity', ' min_negative_polarity', ' max_negative_polarity']].copy()
print(words_cols.corr())
corrs = words_cols.corr()
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corrs, ax=ax)
plt.title(" Words aspect features correlation map", fontsize=20)
Are there other features that could be added to the data or created from existing features? Which ones?
Heatmap displays numeric tabular data where the cells are colored depending upon the contained value.
From the color scales, some positively correlated variables (e.g. 'log_n_unique_tokens' and'log_n_non_stop_unique_tokens') are shown in purple, while the negatively correlated variables are shown in blue (e.g. 'kw_avg_max' and log_kw_min_min').
# Plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
df_heatmap = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr(), cmap="BuPu")
Compared to the heatmap as shown above, clustermap is to plot a matrix using hierachical clustering to arrange the rows and columns. Here, the positive correaltion relationship is shown in green, while the negative correaltion relationshio is shown in blue. The close varaibles can be clustered together.
Some examples of close variable that could be clustered are:
# Clustermap is to plot a matrix using hierachical clustering to arrange the rows and columns.
numeric = [c for i,c in enumerate(df.columns) if df.dtypes[i] in [np.float64, np.int64]]
len(numeric)
cmap = sns.diverging_palette(255, 133, l=60, n=7, as_cmap=True, center="dark")
sns.clustermap(df[numeric].corr(), figsize=(14, 14), cmap=cmap);
# Cut log_shares into 2 groups (0, 1)
df_cut = df['log_shares_cut'] = pd.qcut(df['log_shares'], 2, labels = ('unpopular', 'popular'))
# Get 'log_shares' position
df.columns.get_loc('log_shares')
# Drop the above column
df.drop(df.columns[46], axis=1, inplace=True)
# Samples for pairplot as group by the log_share_cut (0, 1)
sns.pairplot(df, vars = ['average_token_length', 'num_keywords', 'global_subjectivity','title_sentiment_polarity',
'abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs',
'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess'], hue = "log_shares_cut", palette="husl", height=2)
# Pick log transformed variables, transform and prepare for PCA
from sklearn.preprocessing import StandardScaler
features = ['average_token_length', 'num_keywords', 'global_subjectivity','title_sentiment_polarity',
'abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs',
'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess']
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['log_shares_cut']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)
# Try PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
# Concat two component and prepare to plot
finalDf = pd.concat([principalDf, df[['log_shares_cut']]], axis = 1)
finalDf.head(10)
# Plot 2 component PCA
fig = plt.figure(figsize = (6,6))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
log_shares_cuts = ['unpopular', 'popular'] # 0 = unpopular, 1 = popular
colors = ['r', 'b']
for log_shares_cut, color in zip(log_shares_cuts, colors):
indicesToKeep = finalDf['log_shares_cut'] == log_shares_cut
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 15)
ax.legend(log_shares_cuts)
ax.grid()
Linear Discriminant Analysis (LDA) is the alternative method to find the boundaries around clusters of classes in better separability. We will attempt this in Lab 2
Online News Popularity Dataset from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
Kelwin Fernandes, Pedro Vinagre, and Paulo Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News (https://pdfs.semanticscholar.org/ad7f/3da7a5d6a1e18cc5a176f18f52687b912fea.pdf)